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Abstract 

Objectives This study aims to gain insight into what con- 
stitutes the drug community in the Russian Federation; in- 
formation that is absent in official governmental data but is 
vital for developing effective and much needed intervention 
strategies to counter the on-going 'drug epidemic'. 
Methods Members of the on-line drug community are iden- 
tified from a crawled set of almost 100,000 users from the 
social network 'LiveJournaV by context sensitive text min- 
ing of the users' blogs using a dictionary of known drug- 
related official and 'slang' terminology. The interests that 
are more (or less) common within this sub-community are 
determined using Fisher's exact tests and Hochberg and Ben- 
jamini's false discovery rate control procedure. A 'psycho- 
logical portrait' of the 'average' Russian drug user is created 
by clustering these indicative interests. In addition, a naive 
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Bayesian classifier is presented for assessing one's suscepti- 
bility to the 'drug virus'. 

Results A total of 268 significant interests separating be- 
tween users that most actively spread information on nar- 
cotics and the rest of the network and a set of themes sum- 
marizing these interests. Three sub -networks of users which 
can be uniquely classified as being either 'infectious', 'sus- 
ceptible' or 'immune' to the 'drug virus'. 
Conclusions The 'average' drug user in the Russian Feder- 
ation is generally more interested in topics such as Russian 
rock, non-traditional medicine, UFOs, Buddhism, yoga and 
the occult. The three sub-networks are all scale-free. The 
presented method seems to be fruitful for assessing opaque 
communities within society. 

Keywords Illicit drug use • Social network • LiveJournal • 
Power-law • Russian Federation 



1 Introduction 

Since the fall of the Soviet Union in the early nineties drug 
abuse has seen a dramatic increase in the Russian Federa- 
tion. From 1990 to 2001 the number of registered drug ad- 
dicts and drug-related crimes went up a nine- and fifteen- 
fold respectively (Sunami, 2007) and continued to rise over 
the last decade (Mityagin, 2012). The rapid spread and ex- 
tent of this 'drug epidemic' is of immediate concern to the 
Russian government and finding effective ways to halt this 
trend is considered to be of outmost importance. 

Due to the criminal nature and general social disapproval 
of drug use it is complicated to assess the drug community 
directly. Official governmental statistics do provide an in- 
sight into the general trend, but only manage to scratch the 
surface of the entire drug community in the Russian Feder- 
ation. The drug users registered in their databases are often 
among the extreme cases: they have been in one (or more) 
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rehabilitation programs or were arrested for using and/or 
selling illicit narcotics. The (still) 'moderate' user stays out 
of the picture, making it difficult to obtain reliable informa- 
tion on the drug community as a whole. Within crimino- 
logical research this non-registered crime is often referred 
to as dark number, see Coleman and Moynihan (1996), and 
Rhodes et al. (2006). 

Gaining a better understanding of what constitutes the 
drug community in the Russian Federation and in which 
ways its members can influence (or even inspire) others to 
start using might prove valuable for devising more effective 
intervening strategies that can turn the current situation for 
the better. 

In order to handle the drug society's inherent complex- 
ity, we will partition the Russian population into (roughly) 
three groups varying in their involvement in illicit drug use: 

1. The immune: the group of people that because of, for 
example, social commitments (e.g., marriage, children, 
job) and/or strongly held (religious) convictions will not 
be persuaded to start using drugs. 

2. The infectious, i.e., the drug community: the group con- 
sisting of all individuals involved with drug abuse in one 
way or another (i.e., using, selling or producing). 

3. The susceptible containing all individuals that are not 
a member of one of the previously mentioned groups. 
They are not involved in any way with illicit drug use at 
the moment, but might, due to their social position and 
environment be drawn toward drug use in the future. 

The idea to divide the population into these three groups 
was inspired by the division often used in models for virus 
spread, see for example the SIR-model of Daley and Kendall 
(1964), since a similar process seems to underlie the spread 
of drug addiction through society: infectious (drug users/ 
dealers) can infect susceptible others with the (drug) virus 
by means of direct and personal contact (i.e., sharing or sell- 
ing drugs). This analogy has been made before, not only 
between virus spread and drug addition (Agar, 2005; Been- 
stock and Rahav, 2004; Mityagin, 2012), but also in the field 
of 'obesity spreading' (Gallos et al. 2012) and for modeling 
the spread of information (Iribarren and Moro, 2009; Onnela 
et al., 2007; Bernardes et al., 2012). 

Social network sites (SNSs) have proved over the years 
that they provide means to uncover social structures and pro- 
cesses that were difficult to observe before (Scott, 2011). 
In this paper we investigate the social network site Live- 
Journal 1 . With approximately 2.6 million registered Rus- 
sian users and over 39 million registered users worldwide, 
it is one of the largest and most popular SNSs in the Rus- 
sian Federation. The site offers its users an easy-to-use blog- 
platform where people can read and share their articles with 

1 LiveJoumal is available at http: //www. live journal . com (En- 
glish) and http : / /www . live j ournal . ru (Russian). 



others. In contrast to micro-blogging SNSs such as Face- 
book 2 (Wilson et al., 2012; Ferri et al., 2012) or Twitter 3 of- 
ten mentioned in the literature, the site offers a tremendous 
amount of large user- written texts, making it extremely suit- 
able for text-mining and, consequently, a unique source of 
data. Maybe because of having the impression to be among 
'friends', LiveJournal users write sometimes quite openly 
about their personal lives in their blogs. Some even com- 
ment on their use of drugs and their experiences with vari- 
ous kinds of narcotics. Others (the extreme cases) describe 
in detail the production process. These openly online ex- 
pressions can be ascribed to the on-line disinhibition effect 
(Suler, 2004); the invisible and anonymous qualities of on- 
line interaction lead to disinhibited, more intensive, self- 
disclosing and aggressive uses of language. Furthermore, 
recent studies show that criminal organizations are actively 
using on-line communities as a new 'business' tool for com- 
munication, research, logistics, marketing, recruitment, dis- 
tribution of drugs and monetarization (Decary-Hetu and 
Morselli, 2011; Europol, 2011; Walsh, 2011; Choo and 
Smith, 2008; Williams, 2001). Research of on-line commu- 
nities, therefore, might aid in gaining a better understanding 
of the behavior of opaque networks within a society. 

In order to get a better insight into the drug community 
in the Russian Federation, we crawl a large randomly se- 
lected group of Russian LiveJournal users. Every blog en- 
try of every user is associated with a weight indicating to 
what extent it refers to illicit forms of drug use by overlaying 
the document word-for-word with a dictionary consisting of 
known drug-related terminology (both official as well as in- 
formal/' slang'). When the sum of 'indicator' weights of all 
the blog entries of a specific user reaches a certain threshold, 
the user is considered to be a member of the on-line drug 
community. The idea behind this approach is that drug users 
are more likely to use drug-related terminology in their blog 
entries than others. We will return to this assumption ex- 
tensively in Section 5. The way users are classified and the 
drug-dictionary are discussed in detail in Section 3.2. 

After identifying the on-line drug community, we might 
ask ourselves what kind of people are generally to be found 
in this sub-network? In order to get a better picture of the 
'average' user in this sub-community, we gather all the in- 
terests mentioned on each user's profile page and compare 
how often they appear within the on-line drug community 
with the frequency of appearance in the rest of the network. 
We limit ourselves here to interests, due to the fact that it is 
rather unclear how to automatically construct a 'psycholog- 
ical profile' of a user based solely on his or her texts. That 
way, we try to isolate those interests that are truly more com- 
mon in one of these two distinct groups of users. In Section 
3.3 we describe the used methodology in more detail. 

2 Facebook is available at http : //www . f acebook . com. 

3 Twitter is available at http : //www . twitter . com. 
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The susceptibility of people to the 'drug virus' is thought 
to depend on their exposure to drug-related information and 
their own interest in this topic. This social mechanism of 
transmission is called differential association in which drives, 
techniques, motives, rationalizations and attitudes toward de- 
viant behavior are learned and exchanged by social inter- 
action (Sutherland, 1947; Lanier and Henry, 1998; Haynie, 
2002). From this perspective the number of interests a user 
has in common with the on-line drug community might in- 
dicate a higher susceptibility, since 1) this person is more 
likely to stumble upon blog entries published by member 
of the on-line drug community (which are more often about 
drug use), and 2) it might indicate a certain lifestyle more 
prone to drugs. Following this reasoning, we present a naive 
Bayesian classifier using the log-likelihood ratio method 
(Kantardzic, 201 1; Hastie et al., 2009) in Section 3.4 that as- 
sesses the susceptibility of a user to drugs given his/her per- 
sonal interests. When a user's interests overlap more with 
the interests in the on-line drug community than the inter- 
ests of the rest of the population, they are considered to be 
susceptible. 

Users that were not identified as being a member of the 
on-line drug community on the basis of their written texts 
or as susceptible due to a large similarity with their interests 
and the interests common in the on-line drug community are 
considered to be immune. They do not write (much) about 
illicit drug use and their interests do not suggest a lean to- 
wards the on-line drug community. 

After having (roughly) identified the three subgroups (i.e., 
immune, infectious and susceptible) in the social network 
LiveJournal, we might wonder whether there are structural 
differences between the corresponding subnetworks. In Sec- 
tion 4.3 we will describe and compare them. 

The remainder of this paper is organized as follows. In 
Section 2 we discuss the social network site LiveJournal, 
describe the kind of information users put out about them- 
selves and point to several unique features this SNS has over 
others often studied in the literature. Section 3 describes the 
crawled LiveJournal data set and the methods used to parti- 
tion its users and determine significant interests. The results 
are presented in Section 4. We will finish with several con- 
clusions, a rather extensive discussion and a few pointers 
for future research. In Appendix 1 we explore the frequency 
with which interests appear in the network and show that 
this probability distribution follows a power-law. 

2 The SNS LiveJournal 

The social network site LiveJournal with over 39 million 
worldwide and approximately 2.6 million registered Russian 
users is by far the most popular blog-platform in the Rus- 
sian Federation. With 1 .7 million active users and (approx- 
imately) 130,000 new posts every day the site offers a fast 



body of data for studying social structures and processes 4 . 
In addition to publishing their own articles, the users are 
offered the possibility to enter information on their where- 
abouts (e.g. hometown), demographics (e.g. birthday), their 
personal interests (e.g. favourite books, films and music) and 
even their current mood (e.g. happy, sad). Articles can be 
tagged and an extensive comment system provides the read- 
ers with the possibility to respond and exchange opinions 
and ideas. 

Users can unilaterally declare any other registered user 
as a 'friend', i.e., ties are unidirectional. A tie reflects the de- 
sire of a user to keep up-to-date with the articles of the other. 
Consequently, every profile contains two lists of ties: 1) a 
list of alters that currently follow the articles published by 
the ego, and 2) a list of alters whose articles the ego follows. 
(Note the similarity with Twitter). We will refer to these lists 
as the list of followers and following friends, respectively. 

LiveJournal differs from other (large) social network sites 
in two important aspects: 1) it has a large number of users 
that actively write in Russian, and 2) the texts are large in 
contrast to the micro-blogging SNSs often considered in the 
literature (Wilson et al., 2012). The latter makes LiveJour- 
nal exceptionally suitable for text-mining and, as such might 
provide insights into social structures and processes where 
other SNSs cannot. 

3 Methods 

Section 3.1 describes the data collected from the SNS Live- 
Journal. In Section 3.2 we discuss the drug-dictionary and 
procedure used for classifying those users who are most 
likely to be involved in drug abuse. After colouring the sub- 
network of the on-line drug community, we proceed in Sec- 
tion 3.3 with identifying those interests that are more com- 
mon for this set of users or the rest of the on-line. These 
indicative interests are used by the nave Bayesian classifier 
introduced in Section 3.4 for identifying the 'susceptible' 
and 'immune' subnetworks. We will later analyze the struc- 
ture of these three subnetworks later in Section 4.3. 

3.1 The LiveJournal Data Set 

On the 9th of September 2012 we crawled 98602 randomly 
selected Russian user profiles. For each profile we stored its 
username, the last 25 posted blog entries, personal interests 
and the lists of followers and following 'friends' . In addition 
we stored (when available) the user's birthday and place of 
living. 

In order to collect this data, we developed a distributed 
crawler that employs the MapReduce Model (Lammel, 2007) 

LiveJournal's own statistics page can be found at 
http : //www. live journal . com/ stats .bml . 
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and the open source framework Apache Hadoop (White, 2009) 
The system is similar to the Apache Nutch crawler (Ca- 
farella and Cutting, 2004) but allows for multiple users to 
collect and process data at the same time; the fetcher module 
is moved outside the Hadoop framework making it a sepa- 
rate application that can run on various machine architec- 
tures simultaneously. 

A total of 22357 users fully specified their birthday on 
their public profile (ages higher than 80 were regarded to 
be reported falsely). In Section 4. 1 we explore some charac- 
teristics of the crawled population and compare it with the 
Russian population. 

3.2 The On-line Drug Community 

Users are classified as being a member of the on-line drug 
community by comparing their last 25 blog entries with a 
dictionary of known drug-related terminology collected by 
drug experts at the Saint Petersburg Information and An- 
alytical Center 5 . The total of 368 words in this dictionary 
are split up into two categories: official and informal/ 'slang' 
terminology. Official terminology are words that are unmis- 
takingly related to illicit drug use (e.g., cocaine and heroin) 
and are assigned a high weight, i.e., 5. Informal/'slang' ex- 
pressions can often be interpreted in various ways and can- 
not be directly related to drug use. For example, the Russian 
word 'kolesa' refers normally to wheels while it also can be 
used (in rather dubious circumstances) as a word for pills. 
To account for this ambiguity, 'slang' expressions are as- 
signed a lower weight than official terminology, i.e., 1 . Table 
1 shows a few example words from the dictionary alongside 
their weight and (free) English translation 6 . 

In addition to this set of words, each blog entry was 
also checked for the presence of a collection of drug-related 
phrases. The presence of certain combinations of words in a 
text, e.g., 'injecting' and 'heroin', is a strong indication that 
the author is involved with illicit drug use. In order to ac- 
count for this valuable information, the dictionary consists 
additionally of 8359 phrases, each assigned with a slightly 
higher weight than the mere sum of the words it consists of 7 . 

In order to compare inflected or derived words in the 
posts with words in the dictionary we first reduce them to 
their root form using a Russian version of the Porter stem- 
ming algorithm (Porter, 1980; Porter, 2006). 

5 The homepage of SPb IAC can be found at http : //iac . spb . ru 
(in Russian). 

6 The full drag-dictionary is freely available and can be downloaded 
at http: //escience . if mo . ru/?ws=sub48. 

7 The number of phrases (8359) is rather high in comparison to the 
number of words (368) in this dictionary. This is due to the fact that 
we consider a phrase consisting, for example, of the words 'injecting', 
'heroin' and the phrase with the words 'injection', 'heroin' and 'nee- 
dle' as two separate expressions (where the latter is associated with a 
higher weight than the former). 



When the summed weights of all the blog entries of a 
user reaches a certain threshold, he/she is considered to be 
a member of the on-line drug community. Users who use 
a small number of the words and phrases from the dictio- 
nary in a limited number of blog entries are, thus, less likely 
to be identified as a member than the ones who frequently 
use drug-related terminology throughout a large numbers of 
texts. The threshold was set manually, see Fig. 1. 

We will refer to the entire set of users who's summed 
weights reaches the threshold as the on-line drug community 
throughout the rest of this paper. To what extent the sub- 
community corresponds to the Russian drug community will 
be a point of discussion in Section 5. 



Table 1 Examples of words in the drug-dictionary 



Russian 


English translation 


Weight 


Kokain 


Cocaine 


5 


Geroin 


Heroin 


5 


Mariguana 


Marijuana 


5 


Abstyag 


Withdrawal syndrome 


5 


Tabletki 


Pills 


1 


Kolesa 


Pills/Wheels 


1 



3.3 Identifying Common Interests of the On-line Drug 
Community 

In this section we will formulate an approach for determin- 
ing which interests are most common (or uncommon) for 
a particular subset of SNS users, in our particular case, the 
on-line drug community. 

First, we collect the interests on the profile pages of all 
users in the on-line drug community that at least appear 
more than 10 times. (The reason for disregarding rather un- 
frequent interests is that they do not add much when one 
wants to gain a better understanding of an entire commu- 
nity). Lets denote this set of interests with I = {Ii,h, ■■■ ,Im}- 
Since the members of the on-line drug community are known, 
we are able to count how often users express their interest in 
both this sub-community and the rest of the social network. 
For every interest /, we can, thus, obtain a 2 x 2 contingency 
table similar to Table 2 where (a +b + c + d) = n is the total 



Table 2 The 2x2 contingency table for interest /, 





Drug community 


Rest 


Total 


Is interested in /, 


a 


b 


a + b 


Not interested in /; 


c 


d 


c + d 


Total 


a + c 


b + d 


n 



number of users in the crawled population that have at least 
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one interest on their profile page (i.e., n = 62370), a + c is 
the number of users identified as members of the on-line 
drug community, a + b is the total number of users who ex- 
pressed their interest in /, and c + d are the users not inter- 
ested in /, . The question is whether this interest appears sig- 
nificantly more (or less) in the on-line drug community than 
in the rest of the rest of the network, i.e., do the proportions 
a/(a + c) and b/{b + d) differ? 

We, thus, have m null hypotheses (Hf), one for each in- 
terest /, in I. Applying the two-sided version of Fisher's ex- 
act test 8 (Fisher, 1922; Agresti, 1992) to each contingency 
table provides us with their corresponding /^-values: p\,p 2 , 
...,p,„. 

The total number of null hypotheses is large (3282 to 
be precise, corresponding to the total number of interests 
expressed more than 10 times in the on-line drug commu- 
nity). Simply comparing the obtained /^-values with a com- 
mon fixed significance level (e.g., p < .05) will result in a 
high number of false discoveries, i.e., falsely rejected null 
hypotheses. Benjamini and Hochberg (1995) showed that 
the expected false discovery rate can be upper bounded by 
q £ [0, 1] with the following control procedure 9 (Benjamini 
and Hochberg, 1995; Benjamini and Yekutieli, 2001): 

1. Order the /^-values in increasing order, i.e., pn\ < p^ 2 ) < 

■■■ <P(m)- 

2. For a given q, find the largest k for which p^ < kq. 

3. Reject all ifL for i = 1,2, . . . ,k. 

We will use a g-value of 5%. The interests associated with 
all rejected H9~, I' = {/(i),/(2)j- ■ ■ >^(&)}> 310 considered to 
be the interests that really differ between the on-line drug 
community and the rest of the social network. 

Due to the large sample size and the initially large num- 
ber of interests, the number of significant interests in I' is 
expected to be quite high. Partitioning them into a set of 
themes might help with getting a better overview of the wide 
variety of significant interests. In order to do so, we cluster 
the set of significant interests I' using a hierarchical agglom- 
erative clustering algorithm with a complete linkage strategy 
(Kantardzic, 2011; Everitt, 2001). Complete-linkage is pre- 
ferred here over single-linkage due to the fact is does not 
suffer from the chaining phenomena, i.e., clusters may be 
forced together due to single elements being close to each 
other, even if a majority of elements is very distant. Average- 
linkage was no option due to its high computational load. 



8 A x test originally designed for 2 x 2 contingency tables by Sir 
R.A. Fisher (1922). 

9 Strictly speaking, the expected false discovery rate is only upper 
bounded when the m test statistics are independent, which does not 
hold in this particular case. B. Efron makes the case in his book Large- 
Scale Inference (2010) that this independency constraint is not strong. 



The similarity between two clusters of interests, C\ and C2, 
is defined as 

n(Si ns 2 ) 



sim(Ci,C 2 ) 



(1) 



^n(Si)-n(S 2 ) 

where S\ and S 2 are the sets of users that expressed their 
interest in at least one of the topics in, respectively, C\ and 
C 2 .n(-) returns the number of users. This similarity measure 
is known as cosine similarity or more commonly known in 
biology as the Ochiai coefficient (Ochiai, 1957). We will re- 
fer to the resulting clusters of significant interests as themes 
throughout the rest of this paper. 

3.4 Assessing Susceptibility 

A large number of common interests between a user and the 
on-line drug community might indicate a higher susceptibil- 
ity to drugs, since 1) the user is more likely to stumble upon 
blog entries published by members of this sub-community, 
and 2) it might indicate a certain lifestyle more prone to drug 
use. Certain interests might, on the other hand, indicate a 
low susceptibility. Think of interests that suggest that the 
user in question has certain social commitments (e.g., mar- 
riage, children, job) or strong-held (religious) convictions. 
The idea that interests are related to susceptibility under- 
lies the classification method in this section: an individual 
is considered to be a susceptible user when his/her personal 
interests resemble the interests common for the drug com- 
munity more than the interests of the rest of the on-line pop- 
ulation. 

A naive Bayesian classifier was used (Kantardzic, 201 1). 
Due to the fact that certain combinations of interests are 
rare, we are forced to assume conditional independence be- 
tween each pair of interests and use the log-likelihood ratio 
method. 

Let us first define k feature variables, one for each inter- 
est in the set I': 

F={F l ,F 2 ,...,F k } 

where Fj is true when the user is interested in /, in I' and 
otherwise false. The set of feature variables F is used to de- 
scribe the personal interests of each user in the network. 

The chance that a user belongs to the drug community 
{D) given his/her interests is given by the conditional chance 
P(D I F). Given the assumption that each feature variable Fj 
is conditionally independent of Fj when i ^ j, i.e., | 
D,Fj) = P(Fj I D), this probability can be expressed as 

P(D)J. 



P(D\¥) = 



(2) 



Similarly, the chance of not being a member of the drug 
community given the users interests is 



P(-iD) * 



(3) 
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Fig. 1 The summed weights of the blog entries of each user in the Live- 
Journal data set. The higher the summed weight the more the user used 
the words and phrases present in the drug-dictionary (see Section 3.2). 
Users are considered to be a member of the on-line drug community 
when their weighted sum crosses the threshold of 8 



By applying the log-likelihood ratio method, i.e., dividing 
eq. (2) by eq. (3) and taking the natural logarithm of both 
sides, we find that the inequality P(D | F) > P(->D | F), i.e., 
the user is more likely to belong to the drug community 
given the user's interests, is equivalent to the inequality: 

p(P) ,l log m\p) >0 



log- 



(4) 



A user is considered to be susceptible when he/she does 
not belong to the drug community and this inequality holds. 
Users that are not a member of the on-line drug community 
or considered to be susceptible, are immune. 

4 Results 

In order to identify those users in the network involved with 
illicit drug use, we overlaid their last 25 blog entries with a 
dictionary of known drug-related terminology (see Section 
3.2). Fig. 1 depicts the distribution of the weights assigned to 
the randomly crawled LiveJournal users. Note that the ma- 
jority of users appear to make use of a rather small num- 
ber of drug-related terminology. The fluctuations that can 
be seen around the weights 5, 10 and (less distinct) 15 and 
20 can be explained by the weights assigned to the words 
present in the drug-dictionary (5 for official, clearly drug- 
related, terminology and 1 for (ambiguous) 'slang' expres- 
sions). The users with the highest weights are assumed to be 
the ones most interested and/or involved in illicit drug use. 
The threshold was set to 8 (see Fig. 1), i.e., when the weight 
of a user crosses 8, he/she is considered to be a member 
of the on-line drug community. Other thresholds close to 8 
were considered as well. We found that the themes as pre- 
sented in Section 4.2 did not change tremendously. By set- 
ting the threshold to 8, approximately 20% of the total set 
of crawled users were classified as being a member of the 
on-line drug community. 



4. 1 Characteristics of the SNS LiveJournal 

Fig. 2a depicts the age distribution of the LiveJournal data 
set split out between the on-line drug community and the 
susceptible and immune user groups. Note that this SNS is 
especially popular among 20 to 40 year old individuals. Fig- 
ure 2b depicts the age distribution of the Russian Federa- 
tion as determined on the 1st of January 201 1. The data was 
made available by Rosstat 10 . The major dip around the ages 
62-70 is a reflection of the impact that the Second World 
War had on the Russian population. 

Note the difference between the Russian LiveJournal com- 
munity and the Russian population as a whole. Using Live- 
Journal to sample the Russian population poses two prob- 
lems: 1) one only samples those individuals who are regis- 
tered as a user in this SNS, and 2) we seriously oversam- 
ple the age group 20-40. Both aspects might not pose a real 
threat; the Russian drug community is, as mentioned be- 
fore, difficult (or even impossible) to sample directly, mak- 
ing sampling a SNS one of the limited options one has, when 
one wants to gain a better insight into this sub-community. 
In addition, illicit drug use is known to occur especially in 
this particular age group (Mityagin, 2012). The strong pres- 
ence of this group, thus, might help in gathering more infor- 
mation on the community of interest. 

Of the total number of 98602 users studied in the Live- 
Journal data set, 16553 and 3586 were identified as, respec- 
tively, members of the drug community and susceptible users. 
Susceptible users are identified using the naive Bayesian 
classifier as described in Section 3.4 which makes use of 
the interests the user posted on his/her profile page. Com- 
mon interests can be shown to be rare. In fact, the frequency 
with which an interest is mentioned by users of this SNS can 
be shown to follow a power-law distribution with coefficient 
Y ~ 1.54, see Appendix 1. With a low number of common 
interests, there is often not enough to go on in order to reli- 
ably classify a user as being susceptible, which explains the 
relatively small number of susceptible users found. 



4.2 Drug Indicators 

After applying Fisher's exact test and Benjamini and Hochberg's 
false discovery rate control procedure with a g-value of 5% 
(see Section 3.3), we found 268 of the 3282 initial interests 
to be significant, i.e., the on-line drug community is, thus, 
more/less interested in these topics than the rest of the Live- 
Journal users. In order to assess to what extent an interest / 
is indicative for being a member of the drug community (D) 
or the rest of the population, we use the conditional proba- 
bility P(D | /). Among the interests most indicative for the 

10 The governmental statistics agency of the Russian Federation. 
They can be found at http://www.gks.ru (in Russian) with links 
to their rather extensive database. 
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Fig. 3 The percentage of users within the on-line drag community and the rest of the on-line population interested in each theme (see Table 3). 
Users are considered to be interested in a theme when they mention at least one of the interests contained in that theme. Note that the last three 
themes are more likely to be found in the non-drug section of the network. The other themes are relatively more likely to appear in the on-line drug 
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on-line drag community (i.e., P(D \ I) > .5), we found in- 
terests such as: the White Movement (a loose confederation 
of anti-communist forces who fought the Bolsheviks in the 
Russian civil war; now often associated with the Russian 
nationalistic movement), humanistic psychology, partisans, 
Aryan (ancient people that partly inhabited current Russian 
territory), Stalinism, Dadaism, narcology and Magadan (a 
city in the far east of the Russian territory, famous for its 
large jail). Among interests most indicative for not belong- 
ing to the drug community (i.e., P(D \ I) < .5), we found in- 
terests such as: accessories, beads, jewellery, London, cloth- 
ing, glamour, handmade, shoes, beach and interior design. 

In order to get a better view on the wide variety of signif- 
icant indicative interests, we clustered them using the cluster 
algorithm described in Section 3.3. We found 42 different 



themes in total. In this Section we will only discuss the ones 
most prominent within the on-line drug community and the 
rest of the LiveJournal population. 

Fig. 3 shows the various themes and to what extent they 
appear in the on-line drug community and the rest of the 
LiveJournal population. We consider users to be interested 
in a theme, when they mention at least one of the interests 
contained in that theme on their profile page. 

The names assigned to each theme were determined by 
the writers of this article. In order to overcome some of 
the inevitable subjectivity inherent to this process, we will 
describe the themes shortly in Table 3, where the second 
column denotes the number of significant interests in each 
theme. When the number of interests in a theme is small, we 
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will sum up all the interests (translated to English); other- 
wise we will suffice with a short description. 

In both Fig. 3 and Table 3 the last three themes (Main- 
stream music, Accessories/Clothing and Glamour) appear 
more often in the non-drug section of the network. The oth- 
ers are more common for the on-line drug community. 

Recall that significant interests were clustered solely on 
the basis of their cosine similarity (i.e., the more users that 
expressed their interest in both topics, the higher the 'simi- 
larity'). In which of the two distinct communities the interest 
is more prominent is not taken into account. Each theme is, 
thus, likely to contain interests that are more common for 
the on-line drug community and interests that are more of- 
ten found in the rest of the network. To what extent a theme 
can be related to one of these two groups can, therefore, be 
expected to be less clear than for individual interests. 

4.3 Network Structure Analysis 

In this section we will explore the structure of the on-line 
drug community, susceptible and immune subnetworks. 

Fig. 4a shows the degree distribution of the total crawled 
LiveJournal network. Degree is defined here as the num- 
ber of followers and following 'friends' of a user. Note that 
the number of users seems to decrease exponentially with 
degree; an indication that the distribution might follow a 
power-law: 

p(x) = Cx-V (5) 

where x is the degree of a user, y is the power-law coeffi- 
cient and C is a constant. Power-law distributions appear in 
a wide variety of natural and man-made processes, e.g., the 
number of inhabitants in cities, the diameter of moon crates 
and the intensity of solar flares. The wide-spread appearance 
of the power-law raises the question whether the same pro- 
cess might underlie these (at first glance) different phenom- 
ena, causing quite a discussion in the literature. For a more 
elaborate discussion of power-laws and their appearance, we 
refer the reader to a recent paper by Pinto et al. (2012). 

Fig. 4b shows the rank/frequency log-log plot 11 of the 
degree distribution in 4a. Note the points in this plot lie (ap- 
proximately) on a straight line, which is a characteristic of 
power-law distributions. 

Very few real-word networks display a power-law dis- 
tribution over the entire degree range, making it necessary 
to determine where the degree distribution is most likely to 
start following a power-law (denoted here with x, m „). The 
power-law exponent y and x m ,„ were determined using the 
maximum likelihood method as described in the paper by 

11 A rank/frequency log-log plot is the plot of the occurrence fre- 
quency versus the rank on logarithmically scaled axes. For a more 
elaborate description on how to construct such a plot, see the paper 
by Mark Newman (2005), Appendix A. 



Clauset et al. (2009) and were found to be equal to 1.54 
and 8, respectively. The fit is shown in Fig. 4b as a dashed 
line. Note that the line seems to fit the data quite well. The 
standard statistical test for the quality of fit as proposed by 
A. Clauset, C. Shalizi and M. Newman (2009) shows that 
the data gives no raise to believe that the degree distribution 
does not follow a power-law (i.e., p = .57 with 1000 repeti- 
tions). 

Fig. 5 shows the rank/frequency log-log plots of the de- 
gree distributions of the on-line drug community, suscepti- 
ble and immune network together with their power-law fits. 
Note that these sub-networks also follows a power-law dis- 
tribution, only with slightly different y's. 

Table 4 shows various characteristics of the LiveJournal 
network and its three subnetworks. Standard deviations are 
reported between parentheses. Note that the mean age does 
not differ much. The large differences between the maxi- 
mum degrees of these networks are common for heavy right- 
tailed distributions. The best fits for y, x, m >, and the p-value 
of the goodness of fit test are reported as well. 

5 Conclusions/Discussion 

Drug abuse has seen a dramatic increase in the Russian Fed- 
eration during the last two decades (Sunami, 2007; Mitya- 
gin, 2012). The rapid spread and extent of this 'drug epi- 
demic' forms a serious cause for alarm and finding effective 
ways to halt the current trend is of outmost importance. 

Due to the criminal nature and the general social disap- 
proval of narcotics, it is difficult (or outright impossible) to 
assess the drug community directly. Official governmental 
statistics do provide some insight, but fail to give the com- 
plete picture; the 'moderate' drug user is hardly noticed. In- 
formation retrieved from social networks such as LiveJour- 
nal can, therefore, contribute in gaining a better understand- 
ing of what constitutes the drug community in the Russian 
Federation and might prove to be vital for devising more ef- 
fective intervention strategies. 

In this paper we present a method to assess this non- 
directly observable community by mining the popular so- 
cial network site LiveJournal. By comparing the users' blogs 
with a dictionary consisting of known drug-related Russian 
terminology, we were able to identify those users that write 
most actively about drug use. By collecting their interests, 
we were able to create a general picture of the kind of users 
that can be found within the on-line drug community, see Ta- 
ble 3 and Fig. 3. In addition, we introduced a naive Bayesian 
classifier for identifying potentially susceptible users by com- 
paring their personal interests with the interests most com- 
mon within the on-line drug community. The 'infectious', 
'susceptible' and 'immune' subnetworks were shown to have 
a similar structure; their degree distributions follow a power- 
law, although with slightly varying exponents. 
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Table 3 Description of the most prominent themes 



Theme 



# Description 



Social sciences 
Exact sciences 
Literature 
Politics 



Occult 

Science fiction 

Russian history 

Christianity 
Esotericism 

Eastern teachings 

Singer-songwriters 
Outdoor activities 

Nationalism 

Psychiatry 

Mainstream music 

Accessories/Clothing 

Glamour 



5 Sociology, history, economics, psychology and law. 

7 Programming, biology, astronomy, medicine, archeology, ecology and philosophy. 

9 Containing rather general interests such as books, journalism, poetry and prose. 

22 This theme contains various national (opposition, corruption and Russia), international (Chech- 
nya, NATO, Poland and Ukraine) and general (socialism, democracy and anti-communism) 
political topics. 

1 5 Concerns a wide variety of topics, including, for example, the occult, non-traditional medicine, 
mysticism, clairvoyance, telepathy and the prediction of the future through the reading of cards 
(tarot). 

8 Containing interests like UFOs, futurology, nanotechnology, science fiction and the American 
science fiction writer H. Harrison. 

1 1 Ranging from general sciences (anthropology, ethnography, war history) to particular events in 
the history of Russia (WWII, the Russian civil war) and important historical groups (partizans). 
3 God, the Russian orthodox church and religion. 

7 Contains various topics related to esotericism (esotericism itself, but also the expansion and 
altering of the human mind) and Castaneda, a rather famous author who popularized topics 
such as 'stalking' (technique to control the mind) and lucid dreams. 

10 Various eastern teachings/religions (Buddhism, Zen and yoga) and related terms (e.g., mantras, 
chakras and tantras). 

6 Interests related to Russian rock and singer-songwriters (e.g., V. Vysotsky). 

8 Diving, fishing, hunting and topics related to Mountain climbing (e.g., alpinism and the Altai 
mountains) and survival. 

9 Covering interests such as the Russian empire, patriotism, the Russian people, the White Move- 
ment and antiglobalization. 

6 Including psychiatry, psychoanalysis, psychotherapy, psychosomatic medicine and transper- 

sonal and humanitarian psychology. 
6 Containing several famous mainstream musicians, such as Madonna, Coldplay and Bjbrk. 
13 Varying from accessories like beads, jewelry, shoes and bags to clothing and interior design. 
13 Includes the interest glamour itself. It further covers fashion (e.g., journals, style, jeans, design 

and shopping) and the night-life of Moscow. 
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degree b The rank/frequency log-log plot of the degree distribution and the power-law fit depicted as a dashed line (y « 1.54 and x m 
p- value was found to be approximately .57, i.e., there is no reason to believe that the degree distribution does not follow a power-law 



= 8). The 



It is unclear to what extent we were able to identify the 
users that are really involved in drug use. Users that tend 
to write often about narcotics might do so for the follow- 
ing three reasons: 1) to raise the discussion on the social 
problems caused by drug abuse or propose possible ways to 
change the current situation for the better, 2) in an attempt to 
persuade others to stop or never start using drugs, i.e., 'anti- 



propaganda', or 3) to share their experiences with drugs or 
to express their interest in this topic. We are solely inter- 
ested in the group of users writing about narcotics for the 
third reason; they are the ones that use drugs or are likely to 
do so in the future. 

The appearance of the theme politics in Fig. 3 might be 
best explained by the presence of users in LiveJournal that 
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Fig. 5 The rank/frequency log-log plots of the degree distributions of the three subnetworks in the crawled LiveJournal network: the drug commu- 
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Table 4 Structural characteristics of the various subnetworks in LiveJournal 



Network 


Size 


Edges 


Age 


Max. degree 


r 




p- value 


Drug community 


16553 


61021 


32.08 (9.20) 


160 


1.57 


19 


.97 


Susceptible 


3586 


16499 


32.14 (8.75) 


72 


1.66 


8 


.84 


Immune 


78463 


496018 


30.31 (8.03) 


323 


1.51 


10 


.76 


Total 


98602 


982197 


30.71 (8.32) 


524 


1.54 


8 


.57 



do not write about drags because they are personally in- 
terested or using them, but rather since they want to bring 
the social problems related to narcotics under the attention. 
The same might hold for the themes as the social and ex- 
act sciences, psychiatry and, potentially, nationalism. The 
presence of a theme like Christianity (consisting of the in- 
terests 'God', 'the Russian orthodox church' and 'religion') 
is more likely to be explained by the presence of users that 
spread anti-propaganda, especially when taking the negative 
stance of the church towards drugs into account. 

Themes such as the occult, esotericism, science fiction 
and eastern teachings, however, are hardly explained by stat- 
ing that the users interested in these topics are heavily con- 
cerned with the social impact of drug abuse, or actively spread- 
ing anti-propaganda. Most likely, we caught a glimpse of the 
actual drug community. 

The explanations of why certain themes are presented 
in the on-line drug community are, of course, based solely 
on the view of the authors and, therefore, subjective. Fur- 
ther research is required to establish what themes are truly 
related to the Russian drug community. In order to estab- 
lish the validity of the approach described in this paper, one 
might compare the presented results with law enforcement 
data, e.g., it would be interesting to compare the number of 
convictions for drug-related crimes between the on-line drug 
community and the rest of the crawled LiveJournal popula- 
tion. 

The susceptibility of an individual to drugs was deter- 
mined on the basis of the similarity between his/her personal 



interests and the interests common in the on-line drug com- 
munity. We limited ourselves here to their interests, since 
it was unclear how to relate the susceptibility of a user and 
his/her texts. 

The number of susceptible users is relatively small due 
to the small number of common interests present in the Live- 
Journal network. In fact, it can be shown that the frequency 
with which a certain interest occurs follows a power-law 
with exponent y m 1 .54, see Appendix 1 . With a low number 
of common interests, there is often not enough to go on to 
identify a user as being susceptible. It is, thus, very well pos- 
sible that we overlooked several immune users who should 
have been noted as being susceptible. 

Users were considered to be a member of the on-line 
drug community when the weighted sum of their blog en- 
tries crossed the threshold of 8, see Fig. 1. We experimented 
with different thresholds and found that, although the list of 
significant interests does vary, the resulting clusters/themes 
remain stable. The weights assigned to the official and in- 
formal/'slang' terminology in the drug-dictionary were not 
varied. Since the final themes did not vary much while vary- 
ing the threshold, it is unlikely that they would now. 

As mentioned before, we found that the LiveJournal net- 
work and the infectious, susceptible and immune subnet- 
works are most likely scale-free (i.e., their degree distribu- 
tions follow a power-law). Although the performed good- 
ness of fit test (Clauset et al., 2009) does not exclude other 
possibilities, e.g., Poisson, we can state with certainty that 
the distributions are heavy-right tailed, which entails that 
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the network has hubs, i.e., users with a far higher degree 
than the rest of the network. This knowledge might be of 
major importance when one wants to disrupt the network to, 
for example, limit the spread of drug-related information on 
the network. Removing the hubs would heavily disrupt the 
information flow (Bollobas and Riordan, 2004; Albert et al., 
2000; Crucitti et al., 2003). 

This paper has shown the promise of 'crawling social 
networks' in delineating and analyzing social groups that 
hitherto have eluded such research, because of the funda- 
mentally opaque nature of membership of such groups. The 
case in point is the Russian drugs community. We hope that 
continuing research along the lines we set out in this paper 
will help to map the dynamics of this group, and will ulti- 
mately contribute to halting, if not reverting its tragic trend 
to grow. 
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Appendix 1: LiveJournal User Interests 

In this appendix we take a closer look at the frequency with 
which interests are expressed by the users of the social net- 
work LiveJournal. Fig. 6 shows the frequency of occurrence 
of interests within the crawled population. Note that the dis- 
tribution is heavy right-tailed; its slope suggests that the dis- 
tribution might follow a power-law, see eq. (5). Fig. 6b shows 
the corresponding rank/frequency log-log plot of the his- 
togram in 6a. The exponent y « 1 .54 and the start of the 
distribution x min = 3 were approximated using the maximum 
likelihood method as proposed by Clauset et al. (2009). Note 
that the fitted line in 6b approximates the distribution quite 
well. The standard goodness-of-fit test (Clauset et al. 2009) 
indicates there is no reason to believe that the distribution 
does not follow a power-law, i.e., the p-value was approxi- 
mately equal to .57. 

The fact that the distribution of interests within the SNS 
LiveJournal is heavy-right tailed explains why the number of 
susceptible users (see Table 4) is relatively small compared 
to the other groups. 
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